[server][dvc] change the client side transfer timeout configurable and close channel once timeout. #1805

jingy-li · 2025-05-15T21:15:30Z

Problem Statement

When onboarding the blob transfer bootstrap feature to a large store (e.g., 10GB per partition, 120GB per host), the transfer time is so long that it triggers a client-side timeout exception. Upon reaching the timeout, a partition cleanup is performed before moving to the next host.

However, during the cleanup process, the channels are not closed, and Netty continues receiving transferred files. If files are being cleaned up while validation is happening, checksum failures occur, resulting in checksum errors. These failures trigger the exceptionCaught method, which eventually leads to the channel being closed.

As a result, incomplete cleanups occur—some files are deleted, but others that are still being transferred or created after the cleanup begins remain. This race condition arises because file transfers and cleanups are happening concurrently.

Ultimately, even if the blob transfer fails and the bootstrap falls back to Kafka ingestion, the incomplete cleanup leads to database corruption due to residual files.

Solution

Allow the client-side timeout to be configurable. Previously, we only have server-side timeout config.
Close the channel upon timeout to prevent continued file reception.
Reduce the server-side active user count when the channel becomes inactive. This prevents the server from maintaining an active connection count if the connection is unexpectedly interrupted.

Code changes

Added new code behind a config. If so list the config names and their default values in the PR description.
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

Verify the reduction of the server-side active user count through integration tests and unit test
The timeout should be tested at the host E2E level, as integration tests complete too quickly (in seconds) to reach a timeout set at the minute level.

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.

…d close channel once timeout.

…hen the channel becomes inactive.

...nts/da-vinci-client/src/main/java/com/linkedin/davinci/blobtransfer/BlobSnapshotManager.java

...i-client/src/main/java/com/linkedin/davinci/blobtransfer/client/NettyFileTransferClient.java

…unter

gaojieliu

We should add a test to simulate the concurrent blob transfer requests to make sure the global limit work.

...ent/src/main/java/com/linkedin/davinci/blobtransfer/server/P2PFileTransferServerHandler.java

This reverts commit 1091c9d.

This reverts commit e77d688.

…ange initializer use same handler per connection 3. Do check max concurrent user when bump it

gaojieliu

LGTM, thanks!

[server][dvc] change the client side transfer timeout configurable an…

42c154a

…d close channel once timeout.

jingy-li requested review from sixpluszero and gaojieliu May 15, 2025 21:15

Add a unit test to verify that the concurrent user count is reduced w…

3000c3c

…hen the channel becomes inactive.

gaojieliu reviewed May 16, 2025

View reviewed changes

...nts/da-vinci-client/src/main/java/com/linkedin/davinci/blobtransfer/BlobSnapshotManager.java Outdated Show resolved Hide resolved

...i-client/src/main/java/com/linkedin/davinci/blobtransfer/client/NettyFileTransferClient.java Outdated Show resolved Hide resolved

Address code review 1: 1. remove client side timeout 2. set global co…

b939316

…unter

jingy-li requested a review from gaojieliu May 19, 2025 18:55

jingy-li added 2 commits May 19, 2025 17:15

Fix a bug from previously remove snapshot generation from EOP

1091c9d

fix unit test

e77d688

gaojieliu reviewed May 20, 2025

View reviewed changes

...ent/src/main/java/com/linkedin/davinci/blobtransfer/server/P2PFileTransferServerHandler.java Show resolved Hide resolved

...ent/src/main/java/com/linkedin/davinci/blobtransfer/server/P2PFileTransferServerHandler.java Outdated Show resolved Hide resolved

jingy-li added 3 commits May 20, 2025 14:49

Revert "Fix a bug from previously remove snapshot generation from EOP"

8a28cfe

This reverts commit 1091c9d.

Revert "fix unit test"

cb60c0d

This reverts commit e77d688.

address code review 2: 1. Add unit test to veriy the 429 error. 2. ch…

10abce8

…ange initializer use same handler per connection 3. Do check max concurrent user when bump it

jingy-li requested a review from gaojieliu May 21, 2025 02:14

fix spotbugsTest/spotbugsMain

8f06310

gaojieliu approved these changes May 21, 2025

View reviewed changes

jingy-li merged commit 615a5de into linkedin:main May 21, 2025
59 checks passed

jingy-li deleted the fix-cert0-large-store-timeout-bug branch May 21, 2025 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server][dvc] change the client side transfer timeout configurable and close channel once timeout. #1805

[server][dvc] change the client side transfer timeout configurable and close channel once timeout. #1805

jingy-li commented May 15, 2025 •

edited

Loading

gaojieliu left a comment

gaojieliu left a comment

[server][dvc] change the client side transfer timeout configurable and close channel once timeout. #1805

[server][dvc] change the client side transfer timeout configurable and close channel once timeout. #1805

Conversation

jingy-li commented May 15, 2025 • edited Loading

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

gaojieliu left a comment

Choose a reason for hiding this comment

gaojieliu left a comment

Choose a reason for hiding this comment

jingy-li commented May 15, 2025 •

edited

Loading